Visualizing Categorical Distributions

Reading in the data

import altair as alt
import pandas as pd

movies_extended = pd.read_csv("../../data/movies-extended.csv")
movies_extended
Title US Gross Worldwide Gross US DVD Sales ... Director Rotten Tomatoes Rating IMDB Rating IMDB Votes
0 Boynton Beach Club 3127472.0 3127472.0 NaN ... NaN NaN NaN NaN
1 Broken Arrow 70645997.0 148345997.0 NaN ... John Woo 55.0 5.8 33584.0
2 Brazil 9929135.0 9929135.0 NaN ... Terry Gilliam 98.0 8.0 76635.0
... ... ... ... ... ... ... ... ... ...
1187 Zodiac 33080084.0 83080084.0 20983030.0 ... David Fincher 89.0 NaN NaN
1188 The Legend of Zorro 45575336.0 141475336.0 NaN ... Martin Campbell 26.0 5.7 21161.0
1189 The Mask of Zorro 93828745.0 233700000.0 NaN ... Martin Campbell 82.0 6.7 4789.0

1190 rows × 16 columns

Bar charts are effective for visualizing categorical “distributions” of a single column

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'))

Stacked bar charts can visualize counts for combinations of two categorical columns

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'))

Reordering the bar segments aligns it with the order in the legend

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Rescaling the bar lengths facilitates comparing proportions between bars

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Sorting by the length of one of the coloured segments make the chart easier to read

sort_order = ['Adventure', 'Musical', 'Comedy', 'Romantic Comedy', 'Action',
              'Drama', 'Concert/Performance', 'Documentary', 'Western',
              'Thriller/Suspense', 'Horror', 'Black Comedy'] 
alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort=sort_order),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Normalize stacked bar charts are effective at visualizing just a few categories

sort_order = ['Concert/Performance', 'Musical', 'Documentary', 'Adventure', 
              'Comedy', 'Romantic Comedy', 'Drama',  'Action']
alt.Chart(movies_extended[movies_extended['MPAA Rating'].isin(['G', 'PG'])]).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort=sort_order),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Showing bars side by side makes it easier to compare their exact heights within a category

(alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', title=''),
    alt.Y('MPAA Rating', title=''),
    alt.Color('MPAA Rating', legend=None))
 .properties(width=100, height=45)
 .facet('Major Genre', columns=4)
 .resolve_scale(x='independent'))

Switching the faceting and y column targets the plot towards a slightly different question

(alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', title=''),
    alt.Y('Major Genre', title='', sort='x'),
    alt.Color('MPAA Rating', legend=None))
 .properties(width=100, height=150)
 .facet('MPAA Rating')
 .resolve_scale(x='independent'))

Heatmaps are effective for visualizing counts of two-dimensional categorical data

alt.Chart(movies_extended).mark_rect().encode(
    alt.Color('count()'),
    alt.X('MPAA Rating'),
    alt.Y('Major Genre', sort='color'))

Using both the colour and marker size to indicate the count creates a more effective visualization

alt.Chart(movies_extended).mark_circle().encode(
    alt.X('MPAA Rating'),
    alt.Y('Major Genre', sort='color'),
    alt.Color('count()'),
    alt.Size('count()'))

Let’s apply what we learned!